Search CORE

9 research outputs found

Dynamic Hardware Resource Management for Efficient Throughput Processing.

Author: Sethia Ankit
Publication venue
Publication date: 01/01/2015
Field of study

High performance computing is evolving at a rapid pace, with throughput oriented processors such as graphics processing units (GPUs), substituting for traditional processors as the computational workhorse. Their adoption has seen a tremendous increase as they provide high peak performance and energy efficiency while maintaining a friendly programming interface. Furthermore, many existing desktop, laptop, tablet, and smartphone systems support accelerating non-graphics, data parallel workloads on their GPUs. However, the multitude of systems that use GPUs as an accelerator run different genres of data parallel applications, which have significantly contrasting runtime characteristics. GPUs use thousands of identical threads to efficiently exploit the on-chip hardware resources. Therefore, if one thread uses a resource (compute, bandwidth, data cache) more heavily, there will be significant contention for that resource. This contention will eventually saturate the performance of the GPU due to contention for the bottleneck resource,leaving other resources underutilized at the same time. Traditional policies of managing the massive hardware resources work adequately, on well designed traditional scientific style applications. However, these static policies, which are oblivious to the application’s resource requirement, are not efficient for the large spectrum of data parallel workloads with varying resource requirements. Therefore, several standard hardware policies such as using maximum concurrency, fixed operational frequency and round-robin style scheduling are not efficient for modern GPU applications. This thesis defines dynamic hardware resource management mechanisms which improve the efficiency of the GPU by regulating the hardware resources at runtime. The first step in successfully achieving this goal is to make the hardware aware of the application’s characteristics at runtime through novel counters and indicators. After this detection, dynamic hardware modulation provides opportunities for increased performance, improved energy consumption, or both, leading to efficient execution. The key mechanisms for modulating the hardware at runtime are dynamic frequency regulation, managing the amount of concurrency, managing the order of execution among different threads and increasing cache utilization. The resultant increased efficiency will lead to improved energy consumption of the systems that utilize GPUs while maintaining or improving their performance.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113356/1/asethia_1.pd

Deep Blue Documents at the University of Michigan

Mascar: Speeding up GPU Warps by Reducing Memory Pitstops

Author: Ankit Sethia
D Anoushe Jamshidi
Scott Mahlke
Publication venue
Publication date: 23/04/2020
Field of study

Abstract-With the prevalence of GPUs as throughput engines for data parallel workloads, the landscape of GPU computing is changing significantly. Non-graphics workloads with high memory intensity and irregular access patterns are frequently targeted for acceleration on GPUs. While GPUs provide large numbers of compute resources, the resources needed for memory intensive workloads are more scarce. Therefore, managing access to these limited memory resources is a challenge for GPUs. We propose a novel Memory Aware Scheduling and Cache Access Re-execution (Mascar) system on GPUs tailored for better performance for memory intensive workloads. This scheme detects memory saturation and prioritizes memory requests among warps to enable better overlapping of compute and memory accesses. Furthermore, it enables limited re-execution of memory instructions to eliminate structural hazards in the memory subsystem and take advantage of cache locality in cases where requests cannot be sent to the memory due to memory saturation. Our results show that Mascar provides a 34% speedup over the baseline roundrobin scheduler and 10% speedup over the state of the art warp schedulers for memory intensive workloads. Mascar also achieves an average of 12% savings in energy for such workloads

CiteSeerX

Highly Concurrent Latency-tolerant Register Files for GPUs

Author: Abdel-Majeed Mohammad
Aho Alfred V.
Annavaram Murali
Annunziata A. J.
Bakhoda Ali
Bakhoda Ali
Bakhoda Ali
Bakhshalipour M.
Brown Jeffery A.
Chappell Robert S.
Che Shuai
Collins Jamison D.
Collins Jamison D.
Cooper Keith D.
Ebrahimi Eiman
Ebrahimi Eiman
Ebrahimi Eiman
Esfeden Hodjat Asghari
Esfeden Hodjat Asghari
Fukami S.
Gebhart Mark
Gebhart Mark
Hashemi Milad
Hashemi Milad
Hecht Matthew S.
Jang Hyunjun
Jeon H.
Jeon Hyeran
Jiang Nan
Jing Naifeng
Jog Adwait
Jog Adwait
Jones Timothy M.
Kadjo David
Kamruzzaman Md
Kayıran Onur
Kayiran Onur
Kayiran Onur
Khorasani Farzad
Kim Dongkeun
Lattner Chris
Lee Benjamin C.
Lee Chang Joo
Lee Sangpil
Leng Jingwen
Li Chao
Liao Steve S. W.
Lindholm John Erik
Lipasti Mikko H.
Liu Yang
Lu Jiwei
Luk Chi-Keung
Mirhosseini Amirhossein
Mirhosseini Amirhossein
Muralimanohar Naveen
Mutlu Onur
Mutlu Onur
Mutlu Onur
Mutlu Onur
Mutlu Onur
Mutlu Onur
Nachiappan Nachiappan Chidambaram
Narasiman Veynu
Nematollahi N.
NVIDIA.
NVIDIA.
Oehmke David W.
Parkin Stuart S. P.
Patney Anjul
Pekhimenko Gennady
Pekhimenko Gennady
Qureshi Moinuddin K.
Rixner Scott
Rogers Timothy G.
Roth Amir
Sadrosadati Mohammad
Sethia Ankit
Sethia Ankit
Sharif Ahmad
Srinath Santhosh
Stratton John A.
Tian Yingying
Trishul
Venkatesan Rangharajan
Vijaykumar N.
Wang Zhenlin
Wenisch Thomas F.
Zuravleff William K.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref